12/11/2018

Content

1. Background

2. Question design and data source

3. Dataset overview

4. Data cleansing

5. EDA

6. Logistic regrssion model

7. Evaluation

9. Conclusion

Background

  • Titanic is a famous movie released in 1997.
  • It is a romantic and sad love story.

Background

  • Behind the movie, Titanic sinking is a real event happened in 1912 because of colliding with an iceberg.
  • It is a big disaster more than 1500 people dead while Titanic had an estimated 2,224 people on board.

Question design and data source

  • Question
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.

Some groups of people were more likely to survive than others, so we design the question on analyzing what factors lead to the survive/loss of passengers and crews based on their information.

  • Data source
Our data is from Kaggel "Titanic: Machine Learning from Disaster"
https://www.kaggle.com/c/titanic

Dataset overview

## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
\(Survived\)
0=No, 1=Yes
\(Pclass\)

Ticket class 1st=upper, 2nd=Middle, 3rd=lower

Dataset overview

\(Sibsp\)
siblings / spouses aboard the Titanic
\(Parch\)
parents / children aboard the Titanic
\(Ticket\)
Ticket number
\(Fare\)
Passenger fare
\(Cabin\)
Cabin number
\(Embarked\)

Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

Data cleansing

We need to check the NA value in variables.

Data cleansing

  • We use the most frequent values "S" to replace the NA in \(Embarked\).

  • We use mean of \(Age\) to replace the NA in \(Age\).

  • We drop \(Cabin\) becuase it has large percentage of NA and least importance.

Data cleansing

  • Next, we need to combine some of the variable.

    Because \(SibSp\) and \(Parch\) are both the realted family members on board the Titanic.

    So we could combine \(SibSp\) and \(Parch\) together to calculate the total family member of this passenger.

Data cleansing

Data cleansing

  • Based on the numbers of family member,
    we conduct a new factor variable FamilySize, which contains
    'Single', 'Small', 'Big'.
  • Classification criterion

    Single: family member= 0
    Small:  family member= 1 || family member= 2
    Big:    family member > 2 
##    Big Single  Small 
##     91    537    263

Data cleansing

  • We classify age into different AgeGroup.
    'Child', 'Juvenile', 'Youth', 'MiddleAge', 'Senium'.
  • Classification criterion

    Child: age<= 6
    Juvenile: 6< age<=  17
    Youth: 17< age<= 40
    MiddleAge: 40.5<= age <=65
    Senium: age> 65

Data cleansing

EDA

The correlation between variable

EDA

Histogram by group

EDA

Histogram by group

EDA

Histogram by group

EDA

Histogram by group

Logistic regression model

  • We select \(Survived\) as dependent variable.

  • Becuase \(Survived\) is binary so we decide to use logistic regression.

Logistic regression model

Initial model

## 
## Call:
## glm(formula = Survived ~ Sex + AgeGroup + FamilySize + Pclass + 
##     Embarked, family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6809  -0.6706  -0.4057   0.6120   2.4845  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.00043    0.54265   7.372 1.68e-13 ***
## Sexmale           -2.78512    0.20785 -13.400  < 2e-16 ***
## AgeGroupJuvenile  -2.07572    0.55559  -3.736 0.000187 ***
## AgeGroupMiddleAge -3.10495    0.51992  -5.972 2.34e-09 ***
## AgeGroupSenium    -3.84666    1.20691  -3.187 0.001437 ** 
## AgeGroupYouth     -2.52134    0.46818  -5.385 7.23e-08 ***
## FamilySizeSingle   1.44313    0.35590   4.055 5.02e-05 ***
## FamilySizeSmall    1.52340    0.35129   4.337 1.45e-05 ***
## Pclass2           -0.98748    0.26969  -3.662 0.000251 ***
## Pclass3           -2.15825    0.25233  -8.553  < 2e-16 ***
## EmbarkedQ         -0.04071    0.38457  -0.106 0.915685    
## EmbarkedS         -0.43479    0.24216  -1.796 0.072571 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  764.59  on 879  degrees of freedom
## AIC: 788.59
## 
## Number of Fisher Scoring iterations: 5
##       (Intercept)           Sexmale  AgeGroupJuvenile AgeGroupMiddleAge 
##       54.62162288        0.06172143        0.12546658        0.04482697 
##    AgeGroupSenium     AgeGroupYouth  FamilySizeSingle   FamilySizeSmall 
##        0.02135101        0.08035183        4.23394352        4.58780627 
##           Pclass2           Pclass3         EmbarkedQ         EmbarkedS 
##        0.37251559        0.11552692        0.96010289        0.64739747

Logistic regression model

Final model

  • We seperate the train dataset into

    subtrain dataset (80%)

    subvalidation dataset (20%)

  • independent variable
    \(Pclass\), \(Sex\), \(Age\), \(FamilySize\)
  • dependent variable
    \(Survived\)

Logistic regression model

Final model

## 
## Call:
## glm(formula = Survived ~ Sex + AgeGroup + FamilySize + Pclass, 
##     family = binomial(link = "logit"), data = subtrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5330  -0.6091  -0.4544   0.5860   2.4386  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         3.1667     0.5613   5.642 1.68e-08 ***
## Sexmale            -2.7487     0.2244 -12.250  < 2e-16 ***
## AgeGroupJuvenile   -1.6740     0.6380  -2.624 0.008694 ** 
## AgeGroupMiddleAge  -2.9221     0.6084  -4.803 1.56e-06 ***
## AgeGroupSenium     -3.2328     1.2669  -2.552 0.010721 *  
## AgeGroupYouth      -2.2197     0.5537  -4.009 6.09e-05 ***
## FamilySizeSingle    1.7006     0.4160   4.088 4.35e-05 ***
## FamilySizeSmall     1.7828     0.4101   4.347 1.38e-05 ***
## Pclass2            -0.9727     0.2893  -3.362 0.000773 ***
## Pclass3            -2.1174     0.2670  -7.931 2.17e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 954.46  on 712  degrees of freedom
## Residual deviance: 632.72  on 703  degrees of freedom
## AIC: 652.72
## 
## Number of Fisher Scoring iterations: 5
##       (Intercept)           Sexmale  AgeGroupJuvenile AgeGroupMiddleAge 
##       23.72803425        0.06401285        0.18748912        0.05382059 
##    AgeGroupSenium     AgeGroupYouth  FamilySizeSingle   FamilySizeSmall 
##        0.03944646        0.10864363        5.47733010        5.94652228 
##           Pclass2           Pclass3 
##        0.37804731        0.12034388

Evaluation

  • Next we use the remain 20% data in train dataset to evaluate the model.
## [1] "Accuracy 0.837988826815642"

## [1] 0.8818614

Conclusion

  • Our logistic model is good with 0.88 auc, 83.8% classification accuracy.
  • The ROC curve also look nice.
  • The variable \(Sex\), \(AgeGroup\), \(FamilySize\), \(Pclass\) have significant statistic relationship with \(Survived\).
  • Passenger/crew who is female, Child, in class 1, in small size family are more likely to survive.

Conclusion

That was also indicated by the Movie Titanic.

Conclusion

Thank you!